#General Imports
import pandas as pd
import seaborn as sns
import numpy as np
import json
import plotly.express as px
#import potly go
import plotly.graph_objects as go
from plotly.subplots import make_subplots
#set renderer to vscode
import plotly.io as pio
pio.renderers.default = 'notebook'
#Sklearn imports
from sklearn.metrics import *
from sklearn.metrics import confusion_matrix
from plotly.subplots import make_subplots
from sklearn.linear_model import TheilSenRegressor
import warnings
warnings.filterwarnings('ignore')
We have selected a dataset regarding energy usage for buildings in San Francisco, as that allows us to explore the overall trends for San Francisco's building energy usage and the options that San Francisco has to further decrease the energy usage.
The energy usage dataset details the energy usage intensity (EUI) of buildings in San Francisco. The dataset contains information about the year the building was built, the zip code of the building, the property type (for example Commercial, Residential etc.), the self-reported property type, which is a more in-depth version of the property type, as it allows the building owner to specify exactly what it is used for. Furthermore, it also contains the floor area of the building, it's benchmark status (whether it has reported it's EUI) and finally, it contains the EUI, which is the energy consumed per square foot. The dataset contains EUI from 2012 to 2020 for buildings that have complied with building EUI reporting. This means that there exists many buildings in the dataset that have missing values in some years due to lack of reporting, while others have not reported any data at all over the period.
We have chosen these datasets to explore the effects that building and industry specific data have upon the EUI of a building. We hypothesized that building specific data could affect EUI by describing the physical attributes of the building, which could have an influence over the over energy usage.
We hypothesized that newer buildings would have observable changes in their EUI, as they should have modernised systems and pristine isolation measures.
We also hypothesized that the property types had a large effect on the EUI of a building, as the buildings are utilized differently for each type of property use. We expected that we could observe noticeable differences between each individual property type. By extension, we also believed that we would be able to observe a larger amount of building EUI reporting compliance from buildings in the commercial industry, as we expected that this type of property would face greater punishment for not reporting their EUI than those in for example the residential category.
We also hypothesized that the self-reported property types could lead us to different industries that contained buildings with either a particularly high average EUI or outliers that artificially bloated the average EUI of the industry. We wanted to explore both of these scenarios and believed that by exploring them we could propose potential avenues for the city to reduce it's overall building EUI by targetting specific offenders or industries. Furthermore, we hypothesized that we would be able to identify industries with a noticeable decline in average building EUI throughout the reported years and others that did not seem to follow the overall trends of the city, which could also allow us to propose other ways for the city to reduce it's overall build EUI.
We wanted to tell a story about how building EUI in San Francisco changed over the past decade and explore which features possibly affect these changes. We aim to give the reader a thorough understanding of how the different features affects the building EUI and provide our suggestion to how the city of San Francisco could continue to reduce it's building EUI and thereby reduce it's emissions. We do this by discussing our hypothesis, our conclusions based on these, presenting our findings and proposing possible ways to decrease the EUI of buildings in San Francisco. The end user should be able to understand our findings through our interactive visualizations and decide for themselves whether they agree with our proposed solutions or whether they believe that another solution is more desirable.
With our building emissions dataset, we decided to extract only the necessary columns from the dataset, as the dataset contained some columns that had no relevance for the problem space we have defined.
We also wanted to have a row for each EUI report regarding a building, instead of one row per building with many EUI reports, where some could be N/A's. This meant that dropping all N/A's would remove all rows, which was not desirable.
We then removed all N/A's and explored the compliance status for the different industries. We also filtered based on compliance, as we are interested in learning more about the emissions from the ones that comply.
building_emissions_url = "https://data.sfgov.org/api/views/j2j3-acqj/rows.csv?accessType=DOWNLOAD"
columns=["Parcel(s)","Postal Code","Floor Area","2020 Benchmark Status","2019 Benchmark Status","2018 Benchmark Status",
"2017 Benchmark Status","2016 Benchmark Status","2015 Benchmark Status","2014 Benchmark Status",
"2013 Benchmark Status","2012 Benchmark Status","Year Built", "Floor Area",
"Property Type", "Property Type - Self Selected",
"2012 Source EUI (kBtu/ft2)", "2013 Source EUI (kBtu/ft2)", "2014 Source EUI (kBtu/ft2)", "2015 Source EUI (kBtu/ft2)",
"2016 Source EUI (kBtu/ft2)", "2017 Source EUI (kBtu/ft2)", "2018 Source EUI (kBtu/ft2)", "2019 Source EUI (kBtu/ft2)",
"2020 Source EUI (kBtu/ft2)"
]
df = pd.read_csv(building_emissions_url,usecols=columns)
dataframes = []
for i in range(2012,2021):
df_tmp = df[["Parcel(s)","Postal Code","Property Type","Property Type - Self Selected",str(i)+" Benchmark Status",str(i)+" Source EUI (kBtu/ft2)","Floor Area","Year Built"]]
df_tmp["Year"] = i
df_tmp.rename(columns={str(i)+" Benchmark Status":"Benchmark Status",str(i)+" Source EUI (kBtu/ft2)":"Source EUI (kBtu/ft2)","Property Type - Self Selected" : "Industry"},inplace=True)
dataframes.append(df_tmp)
dff = pd.concat(dataframes)
dff["Postal Code"] = dff["Postal Code"].astype(str)
A transformation of the data is performed in order to convert the columns with years in their column name, into two different columns of year and Source EUI.
This leaves us 8 rows total for year columns of each parcel(s).
First we take a look at the missing values and the type of properties:
dffcp = dff.copy()
dffcp["Missing Data"] = dffcp["Benchmark Status"].isna()
fig = px.histogram(dffcp, x="Property Type",
log_y=True,
color="Missing Data",
title="Percentage of missing values for each property type",
histnorm="percent",
color_discrete_map={"True": "blue", "False": "red"},
width=1000,
height=600)
fig.update_layout(yaxis_title="Percentage of missing values with log Y axis",
showlegend=False, legend_traceorder="reversed")
#fig.write_json("figures/missing_data.json")
fig.show()
The above plots shows the missing values for the different property types.
We observe that it is primarily the commercial properties who reported their emissions. The green color indicates missing value and vice versa for the purple.
Since the commercial property type describes the most of our dataset, we chose to focus on this subset, remove any nans and inspect how the distribution is for the actual data
dff = dff.loc[dff["Property Type"] == "Commercial"]
dff.dropna(subset=["Property Type","Benchmark Status"],inplace=True)
fig = px.histogram(dff, x="Benchmark Status", color="Benchmark Status",
title="Plot distribution of compliance status for commercial properties", width=1000, height=600)
fig.update_layout(yaxis_title="Count")
#fig.write_json("figures/Commercial_benchmark_status.json")
fig.show()
In the above plot we observe the distribution of the benchmark status for the reporting of building EUI for the commercial property type.
As we have records spanning multiple years, we are able to learn how well the various industry buildings have been reporting the past years.
#group by year and benchmark status to count the benchmark for each year
agg_ybs = dff.groupby(["Year","Benchmark Status"])["Benchmark Status"].agg(["count"])
agg_y = dff.groupby(["Year"])["Benchmark Status"].agg(["count"]) #total benchmark recordings for each year
agg_perc = agg_ybs.div(agg_y,level="Year") #percentage of total benchmark recordings for each year and status
dfagg = agg_perc.unstack().unstack().reset_index(name="Percentage").drop(columns=["level_0"]).dropna()
dfagg["Percentage"] = dfagg["Percentage"]*100
# # #line plot of benchmark status over the years
fig = px.line(dfagg,
x="Year",
y="Percentage",
color="Benchmark Status",
title="Percentage of commercial benchmark over the years",
width=1000,
height=600)
fig.show()
#fig.write_json("figures/Commercial_benchmark_by_year.json")
The above plot shows the distribution of compliance to the reporting of a sites building EUI throughout the years.
By comparing the percentage distribution we observe a drop in compliance with more companies/buildings not reporting their emissions for 2018/2019. A drop between 2017 and 2018 of about 18%.
As we are interested in learning more about the energy usage from the once that complied, we filter for this.
### Filter for compliance
df_compliant = dff[dff["Benchmark Status"] == "Complied"]
#filter for relevant columns
cols = ["Parcel(s)","Postal Code","Industry","Source EUI (kBtu/ft2)","Year","Floor Area","Year Built"]
df_compliant = df_compliant[cols]
#drop nans and sort by year
df_compliant = df_compliant.dropna(subset=["Industry","Source EUI (kBtu/ft2)","Year"])
df_compliant.sort_values(by="Year", ascending=True, inplace=True)
#select only Industry with data for every year from 2012-2020.
property_years = df_compliant[["Industry","Year"]]
property_years.drop_duplicates(inplace=True)
#get number of uniquer years
n = property_years["Year"].nunique()
#filter for properties with n years
property_years_filtered = property_years.groupby("Industry").filter(lambda x : len(x) == n)
#filter our target dataframe for the properties
df_compliant = df_compliant[df_compliant["Industry"].isin(property_years_filtered["Industry"])]
#a look at the available data
df_compliant.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 9279 entries, 55 to 2557 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Parcel(s) 9279 non-null object 1 Postal Code 9279 non-null object 2 Industry 9279 non-null object 3 Source EUI (kBtu/ft2) 9279 non-null float64 4 Year 9279 non-null int64 5 Floor Area 9279 non-null float64 6 Year Built 9261 non-null float64 dtypes: float64(3), int64(1), object(3) memory usage: 579.9+ KB
A total of 9279 rows for analysis, where each parcel makes up to about 8 years of recorded data related to it. (Some parcel(s) may first have been recorded in the recent years)
As we are looking to reduce building EUI in the future, we are looking for features that can be shown to be correlated with building EUI.
When going through the features in our dataset, we identified four candidate features that could be correlated with building EUI:
dfagg = df_compliant.groupby(["Year"])["Source EUI (kBtu/ft2)"].agg(["sum"]).reset_index()
#for each year, the sum emissions per square foot of all buildings
#for each year, sum = total emissions per square foot for all buildings
fig = px.line(dfagg,
x="Year",
y=["sum"],
title="Yearly EUI for San Francisco",
width=1000,
height=600)
fig.update_layout(yaxis_title="Total EUI", showlegend=False)
fig.write_json("figures/yearly_total_emissions.json")
fig.show()
We observe a trending decline across industries for the city of San Francisco.
#filter for extreme outliers
dfsns = df_compliant.copy()
#Filter out buildings from before 1900 and with source EUI higher than 1000
dfsns = dfsns[dfsns["Year Built"] > 1900]
dfsns = dfsns[dfsns["Source EUI (kBtu/ft2)"] < 1000]
#formatting data types for correlation matrix
dfsns.dropna(inplace=True)
dfsns["Year Built"] = dfsns["Year Built"].astype(int)
dfsns["Postal Code"] = dfsns["Postal Code"].astype("category")
dfsns["Floor Area"] = dfsns["Floor Area"].astype(float)
dfsns["Source EUI (kBtu/ft2)"] = dfsns["Source EUI (kBtu/ft2)"].astype(float)
dfsns["Industry"] = dfsns["Industry"].astype("category")
dfsns["Parcel(s)"] = dfsns["Parcel(s)"].astype("category")
dfsns.drop(columns=["Year"],inplace=True)
#correlation matrix
corr = dfsns.corr().round(2)
fig = px.imshow(corr, color_continuous_scale="RdBu", text_auto=True,
title="Correlation Matrix", width=800, height=600)
fig.write_json("figures/correlation_matrix.json")
fig.show()
While the correlation is low, we still observe a slight correlation between year built and source EUI (0.18) and for source EUI relative to floor area (0.07)
dfagg = df_compliant.groupby(["Year","Industry"])["Source EUI (kBtu/ft2)"].agg(["median","sum"]).reset_index()
dfagg.rename(columns={"sum":"Yearly EUI"},inplace=True)
dfagg.sort_values(by="Yearly EUI", ascending=False, inplace=True)
#NB, sparse data makes some of the results show a wrong picture.
fig = px.bar(dfagg,
x="Year",
y=["Yearly EUI"],
facet_col='Industry',
title="Yearly EUI by industry",
barmode="group",
labels={
'value': 'Sum',
},
facet_col_wrap=3,
facet_row_spacing=0.05,
facet_col_spacing=0.05,
width=1000,
height=1600)
# fig.
#Update figure layout
fig.update_yaxes(matches=None, showticklabels=True)
fig.update_layout(showlegend=False)
#Show plot
fig.show()
#fig.write_json("figures/sum_of_eui_by_year_and_industry.json")
We can observe how the total source EUI for office space, supermarkets and hotels are dominant consumers, which makes sense as our dataset consists of many buildings with these industries located in them.
However, it tells only half the story as the need for office space is hard to change and the median energy usage may be fairly low in comparison to other industries.
Next we look at what industries are responsible for a large portion of the total energy usage in comparison to the industries with high median energy usage.
dfagg = df_compliant.groupby(["Industry"])["Source EUI (kBtu/ft2)"].agg(["count", "median", "std", "sum"]).reset_index()
ptt_median = dfagg.pivot_table(index="Industry", values=["median"]).unstack().reset_index(name="values").rename(columns={"level_0":"func"})
ptt_sum = dfagg.pivot_table(index="Industry", values=["sum"]).unstack().reset_index(name="values").rename(columns={"level_0":"func"})
fig = make_subplots(
cols = 1, rows = 2,
row_heights = [3.5, 3.5],
vertical_spacing = 0.1,
subplot_titles = ('Source EUI (kBtu/ft2): <b> sum<br /> <br />', 'Source EUI (kBtu/ft2): <b>median<br /> <br />'),
specs = [[{'type': 'treemap', 'rowspan': 1}], [{'type': 'treemap'}]]
)
fig.add_trace(go.Treemap(
branchvalues="total",
labels=ptt_sum["Industry"].tolist(),
parents=ptt_sum["func"].tolist(),
values=ptt_sum["values"].tolist(),
textinfo="label+value+percent parent",
root_color="lightgrey"
),row = 1, col = 1)
fig.add_trace(go.Treemap(
branchvalues="total",
labels=ptt_median["Industry"].tolist(),
parents=ptt_median["func"].tolist(),
values=ptt_median["values"].tolist(),
textinfo = "label+value+percent parent",
root_color="lightgrey"
),row = 2, col = 1)
fig.update_layout(margin=dict(t=50, l=25, r=25, b=25), width=1000, height=1200)
fig.write_json("figures/source_eui_treemap.json")
fig.show()
Interestingly enough, data centers have a very high median EUI, which is most likely due to the high power consumption to keep computers running in a small space.
The supermarket however is quite prominent in both treemaps, contributing to about 9% of the total EUI, while also having the 3rd highest median EUI.
#Calculate count, median, std and sum for each industry
dfagg = df_compliant.groupby(["Industry"])["Source EUI (kBtu/ft2)"].agg(["count", "median", "std", "sum"]).reset_index()
dfagg.sort_values(by=["sum","median"], ascending=[False,False], inplace=True)
#Show median vs sum for each industry
fig = px.bar(dfagg,
x="Industry",
y=["median","sum"],
log_y=True,
title="Comparison of Median and Sum EUI for each industry",
barmode="group",
width=1000,
height=800)
#Update figure layout
fig.update_yaxes(matches=None, showticklabels=True)
fig.update_xaxes(tickangle=45)
fig.update_layout(width=1000, height=600, yaxis_title="EUI (kBtu/ft2)")
#Show plot
fig.write_json("figures/median_eui_by_industry_by_year.json")
fig.show()
A side by side log plot of the sum and median comparison for each industry in San Francisco. One could argue that this could motivate the location of new data centers in districts with access to clean energy.
Throughout our data exploration, we have learnt that all of our identified features are somewhat correlated to building EUI.
We can therefore utilize them in our data analysis.
In the following section we will apply data transformations to gain a deeper insight in the development of San Franciscos Site EUI.
We introduce boxplots to learn about the distribution of each industry and their percentage changes in EUI for years of recorded data.
fig = px.box(df_compliant,
title="EUI by Industry on a logarithmic scale",
x="Industry",
color="Industry",
y="Source EUI (kBtu/ft2)",
log_y=True,
width=1000,
height=800)
fig.update_xaxes(tickangle=45)
fig.update_layout(showlegend=False)
fig.write_json("figures/log_source_eui_by_industry.json")
fig.show()
We immediately observe the data center industry having a higher median, while the other industries follow eachother fairly even.
Many of the industries have very long whiskers, which may be indicative of issues with the data or an actualy high variability outside the lower quartile.
#Calculate mean EUI for each industry and year
agg = df_compliant.groupby(["Industry","Year"]).agg({"Source EUI (kBtu/ft2)":"mean"}).reset_index()
#Calculate percentage change in EUI for each industry and year
agg["pct_change"] = agg.groupby(["Industry"])["Source EUI (kBtu/ft2)"].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))
agg = agg.loc[agg["pct_change"] != 0]
#Show percentage change in EUI for each industry and year
fig = px.box(agg,
x='Industry',
y="pct_change",
color="Industry",
title="Yearly industry percentage change in EUI (kBtu/ft2) since 2012")
fig.update_xaxes(tickangle=45)
fig.update_layout(showlegend=False, yaxis_title="Percentage changed", width=1000, height=800)
fig.show()
Above boxplot is of the expanding yearly percentage change since 2012 for each industry.
We observe how the change for instance is large for Services and Malls (pink and darkblue), while generally stable for Supermarkets (right darkblue box).
Generally, across the industries, the median is slightly below 0, which indicates a decline in EUI since 2012.
Comparing the two plots, we observe how data center (left purple box) industry is following a negative path as it has a high median EUI and a positive percentage change in EUI since 2012.
To visualize the change over the years, we animated a plot of the percentage change in EUI since 2012.
pio.renderers.default = 'notebook'
agg.sort_values(by=["Year","pct_change"],ascending=[True,False],inplace=True)
#Show percentage change in EUI for each industry and year
fig = px.bar(agg,
x="Industry",
y="pct_change",
title="Percentage change relative to 2012 for each industry",
color="Industry",
animation_frame="Year",
range_y=[-100, 100],
range_x=[0, len(agg["Industry"].unique())])
fig.update_xaxes(tickangle=40,
title_standoff=0)
fig.update_layout(width=1000,
height=1000,
showlegend=False,
yaxis_title="Percentage changed",)
#Change layout of animation frame slider
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 1500
fig["layout"].sliders[0].pad = {"t": 250, "b": 10}
fig["layout"].updatemenus[0].pad = {"t": 250}
fig.write_json("figures/percentage_change_by_industry_relative_2012.json")
fig.write_html("figures/percentage_change_by_industry_relative_2012.html")
fig.show()
Luckily, we observe that most of the industries have followed a negative percentage change in EUI relative to 2012, with only two industries still having an increased energy usage. (Distribution center and Automobile Dealership)
It may seem odd how the data center seems to have a negative percentage change in 2020 relative to 2012, while in the previous boxplot it has a median of about 17%. This is explained by most if not all years except for 2020 have contain positive percentage changes that were higher than 2012.
We can quickly observe this by locating the data center to inspect the data:
agg.loc[agg["Industry"] == "Data Center"]
| Industry | Year | Source EUI (kBtu/ft2) | pct_change | |
|---|---|---|---|---|
| 28 | Data Center | 2013 | 2059.300000 | -0.189027 |
| 29 | Data Center | 2014 | 3421.600000 | 65.839473 |
| 30 | Data Center | 2015 | 2570.625000 | 24.594077 |
| 31 | Data Center | 2016 | 2449.700000 | 18.733036 |
| 32 | Data Center | 2017 | 2384.200000 | 15.558356 |
| 33 | Data Center | 2018 | 2749.433333 | 33.260631 |
| 34 | Data Center | 2019 | 2276.566667 | 10.341541 |
| 35 | Data Center | 2020 | 1503.900000 | -27.108375 |
Preparation of data and a very simple linear model to predict future years values
#sklearn linear model for regression (robust against outliers)
def regress(x, delta):
y = x["Source EUI (kBtu/ft2)"]
X = x[["Year"]].values.reshape(-1, 1)
return np.squeeze(TheilSenRegressor(random_state=42).fit(X, y).predict(delta))
We choose 2021 and 2022 as years we want the linear model to predict values for
yearly_industry_eui_df = df_compliant.groupby(["Industry","Year"]).agg({"Source EUI (kBtu/ft2)":"mean"}).reset_index()
#apply linear regression to group by year and property type
year = [[2021], [2022]]
preds = yearly_industry_eui_df.groupby(["Industry"]).apply(regress,delta=year).reset_index(name="Source EUI (kBtu/ft2)")
preds["Year"] = [[2021, 2022]] * len(preds)
preds = preds.explode(list(["Source EUI (kBtu/ft2)","Year"]))
We add the predicted values to the original data
#concat perds on to yearly_industry_eui_df
ydf = pd.concat([yearly_industry_eui_df, preds])
ydf = ydf.sort_values(by=["Industry", "Year"], ascending=[True, True])
ydf = ydf.reset_index(drop=True)
Compute the percentage change
ydf["pct_change"] = ydf.groupby(["Industry"])["Source EUI (kBtu/ft2)"].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))
#plot EUI for each property type over the years using plotly express
ydf_no_pred = ydf[(ydf["Year"] != 2021) & (ydf["Year"] != 2022)]
fig = px.scatter(ydf_no_pred,
x="Year", y="pct_change",
color="Industry",
title="Yearly Percentage change in EUI for each property type relative to 2012 EUI",
labels={
"Year": "Year", "pct_change": "Percentage change in Energy Use Index (kBtu/ft2)"},
trendline="lowess",
height=1000)
fig1 = px.scatter(ydf_no_pred,
x="Year", y="pct_change",
color="Industry",
labels={
"Year": "Year", "pct_change": "Percentage change in Energy Use Index (kBtu/ft2)"},
trendline="lowess",
trendline_scope="overall",
height=1000)
fig1.update_traces(line=dict(width=10, color="green", dash="dash"))
fig1.update_traces(showlegend=True, )
fig_f = go.Figure(data=fig.data + fig1.data)
fig_f.data = [t for t in fig_f.data if t.mode == "lines"]
fig_f.update_layout(height=600, width=1000,
title="Yearly Percentage change in EUI for each property type relative to 2012 EUI",
xaxis_title="Year",
yaxis_title="Percentage change in EUI (kBtu/ft2)")
fig_f.write_json("figures/yearly_percentage_change_in_eui_by_industry.json")
fig_f.show()
The yearly change in EUI for each industry. We can observe that while most industries are clearly trending towards a lower building EUI as indicated by the negative percentage change, a few are not. These industries include:
#plot EUI for each property type over the years using plotly express
fig = px.scatter(ydf,
x="Year", y="pct_change",
color="Industry",
title="Yearly Percentage change in EUI for each property type relative to 2012 EUI",
labels={"Year":"Year", "pct_change":"Percentage change in Energy Use Index (kBtu/ft2)"},
trendline="lowess",
height=600, width=1000)
fig1 = px.scatter(ydf,
x="Year", y="pct_change",
color="Industry",
labels={"Year": "Year", "pct_change": "Percentage change in Energy Use Index (kBtu/ft2)"},
trendline="lowess",
trendline_scope="overall",
height=600, width=1000)
fig1.update_traces(line=dict(width=10, color="green", dash="dash"))
fig1.update_traces(showlegend=True, )
fig_f = go.Figure(data=fig.data + fig1.data)
fig_f.add_vrect(x0=2020, x1=2022,fillcolor="green", line_color="red", opacity=0.2, line_dash="dot", annotation_font_size = 25, annotation_position="top", annotation_text="Predictions",)
fig_f.add_vline(x=2020, line_width=3, line_dash="dash", line_color="green")
fig_f.data = [t for t in fig_f.data if t.mode == "lines"]
fig_f.update_layout(height=600,
width=1000,
title="Yearly Percentage change in EUI for each property type relative to 2012 EUI",
xaxis_title="Year",
yaxis_title="Percentage change in Energy Use Index (kBtu/ft2)")
fig_f.write_json("figures/yearly_percentage_change_in_eui_by_industry_with_predictions.json")
fig_f.show()
The yearly change in EUI for each industry with predictions.
We can observe that some industries are predicted to have a worse percentage change than they had in 2020, while others are projected to do better.
We can observe that distribution centers and automobile dealerships remain as some of the industries with the worst percentage change.
median_df = ydf_no_pred.groupby(["Industry"]).agg({"pct_change":"median"}).reset_index()
median_df.rename(columns={"pct_change":"median"},inplace=True)
top_10 = median_df.sort_values(by="median",ascending=False)[:10]
top_sinners = ydf_no_pred.merge(top_10,on="Industry",how="inner")
industries = top_sinners.pivot_table(index="Year", values="pct_change", columns="Industry").fillna(0)
#plot EUI for each industry over the years using plotly express
fig = px.scatter(top_sinners,
x="Year", y="pct_change",
color="Industry",
title="Yearly Percentage change in EUI for each industry relative to 2012 EUI",
labels={
"Year": "Year", "pct_change": "Percentage change in Energy Use Index (kBtu/ft2)"},
trendline="lowess",
height=600, width=1000)
fig1 = px.scatter(top_sinners,
x="Year", y="pct_change",
color="Industry",
labels={
"Year": "Year", "pct_change": "Percentage change in Energy Use Index (kBtu/ft2)"},
trendline="lowess",
trendline_scope="overall",
height=600, width=1000)
fig1.update_traces(line=dict(width=10, color="green", dash="dash"))
fig_f = go.Figure(data=fig.data + fig1.data)
fig_f.data = [t for t in fig_f.data if t.mode == "lines"]
fig_f.update_layout(height=600, width=1000,
title="Top Yearly Percentage change in EUI for each industry relative to 2012 EUI",
xaxis_title="Year",
yaxis_title="Percentage change in Energy Use Index (kBtu/ft2)")
fig_f.update_traces(showlegend=True)
#fig_f.write_json("figures/top_yearly_percentage_change_in_eui_by_industry.json")
fig_f.show()
The yearly percentage change in EUI for the top 10 industries. We can observe that almost all of the top property types end with a negative percentage change from 2012 in 2020.
Only automobile dealerships end in a positive percentage change, which means they increased their building EUI since 2012.
df_compliant["Postal Code"] = df_compliant["Postal Code"].astype(str)
yearly_industry_eui_df = df_compliant.groupby(["Postal Code","Year"]).agg({"Source EUI (kBtu/ft2)":"mean"}).reset_index()
#apply linear regression to group by year and property type
year = [[2021], [2022]]
preds = yearly_industry_eui_df.groupby(["Postal Code"]).apply(regress,delta=year).reset_index(name="Source EUI (kBtu/ft2)")
preds["Year"] = [[2021, 2022]] * len(preds)
preds = preds.explode(list(["Source EUI (kBtu/ft2)","Year"]))
# preds.sort_values(by=["Year","Postal Code"], ascending=[True,True],inplace=True)
ydf = pd.concat([yearly_industry_eui_df, preds])
ydf = ydf.sort_values(by=["Postal Code", "Year"], ascending=[True, True])
ydf = ydf.reset_index(drop=True)
ydf["pct_change"] = ydf.groupby(["Postal Code"])["Source EUI (kBtu/ft2)"].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))
# ydf.loc[ydf["Postal Code"] == "Bank Branch"]
#plot EUI for each industry over the years using plotly express
ydf_no_pred = ydf[(ydf["Year"] != 2021) & (ydf["Year"] != 2022)]
fig = px.scatter(ydf_no_pred,
x="Year", y="pct_change",
color="Postal Code",
title="Yearly Percentage change in EUI for each postal code relative to 2012 EUI",
labels={"Year":"Year", "pct_change":"Percentage change in Energy Use Index (kBtu/ft2)"},
trendline="lowess",
height=600, width=1000)
fig1 = px.scatter(ydf_no_pred,
x="Year", y="pct_change",
color="Postal Code",
labels={"Year": "Year", "pct_change": "Percentage change in Energy Use Index (kBtu/ft2)"},
trendline="lowess",
trendline_scope="overall",
height=600, width=1000)
fig1.update_traces(line=dict(width=10, color="green", dash="dash"))
fig1.update_traces(showlegend=True, )
fig_f = go.Figure(data=fig.data + fig1.data)
fig_f.data = [t for t in fig_f.data if t.mode == "lines"]
fig_f.update_layout(title="Yearly Percentage change in EUI for each postal code relative to 2012 EUI",
height=600, width=1000,
xaxis_title="Year",
yaxis_title="Percentage change in Energy Use Index (kBtu/ft2)")
#fig.write_json("figures/yearly_percentage_change_in_eui_by_industry.json")
fig_f.show()
We observe some of the postal codes of San Francisco with the worst percentage change, especially 94134.
However, they all seem to have a negative slope, which may be indicative of a better EUI in the future.
median_df = ydf.groupby(["Postal Code"]).agg({"pct_change":"median"}).reset_index()
median_df.rename(columns={"pct_change":"median"},inplace=True)
top_10 = median_df.sort_values(by="median",ascending=False)[:10]
top_sinners = ydf.merge(top_10,on="Postal Code",how="inner")
# top_sinners
#plot EUI for each industry over the years using plotly express
fig = px.scatter(top_sinners,
x="Year", y="pct_change",
color="Postal Code",
title="Yearly Percentage change in EUI for each postal code relative to 2012 EUI",
labels={
"Year": "Year", "pct_change": "Percentage change in Energy Use Index (kBtu/ft2)"},
trendline="lowess",
height=600, width=1000)
fig1 = px.scatter(top_sinners,
x="Year", y="pct_change",
color="Postal Code",
labels={
"Year": "Year", "pct_change": "Percentage change in Energy Use Index (kBtu/ft2)"},
trendline="lowess",
trendline_scope="overall",
height=600, width=1000)
fig1.update_traces(line=dict(width=10, color="green", dash="dash"))
fig_f = go.Figure(data=fig.data + fig1.data)
fig_f.add_vrect(x0=2020, x1=2022,fillcolor="green", line_color="red", opacity=0.2, line_dash="dot", annotation_font_size = 25, annotation_position="top", annotation_text="Predictions",)
fig_f.add_vline(x=2020, line_width=3, line_dash="dash", line_color="green")
fig_f.data = [t for t in fig_f.data if t.mode == "lines"]
fig_f.update_layout(height=600, width=1000,
title="Top Yearly Percentage change in EUI for each postal code relative to 2012 EUI",)
fig_f.update_traces(showlegend=True, )
fig_f.show()
The predictions seem to indicate that 94134 continues to increase its EUI, however the model as simple as it is, has not been evaluated or tested, so the guarantee of its predictive capabilities may be questionable.
Loading geodata
with open('geodata.json') as f:
geoJson = json.load(f)
codes = df["Postal Code"].unique()
zips = {'type': 'FeatureCollection', 'features': []}
for feature in geoJson['features']:
if (int(feature['properties']['zip_code_5']) in codes):
zips['features'].append(feature)
agg = df_compliant.groupby(["Postal Code","Year"]).agg({"Source EUI (kBtu/ft2)":"mean"}).reset_index()
agg["pct_change"] = agg.groupby(["Postal Code"])["Source EUI (kBtu/ft2)"].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))
agg = agg.loc[agg["pct_change"] != 0]
pio.renderers.default = 'notebook'
fig = px.choropleth_mapbox(agg, geojson=geoJson,
locations="Postal Code",
featureidkey="properties.zip_code_5",
center={'lon': -122.44, 'lat': 37.76},
zoom=11.5,
color="pct_change",
color_continuous_scale=px.colors.diverging.RdYlGn[::-1],
opacity=0.8,
range_color=(-100, 100),
mapbox_style="stamen-terrain",
title="Source EUI (kBtu/ft2) pr year based on postal code",
animation_frame='Year'
)
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(legend_title_side='top', width=1000, height=1000)
fig.write_json("figures/source_eui_by_postal_code_animated.json")
fig.write_html("figures/source_eui_by_postal_code_animated.html")
fig.show()
Visualizing our findings in a geographical plot we observe general decline in EUI across all of San Fransisco with the expcetion of 94134.
We have used magazine style to our simplest plots. We have done this to reduce the clutter on the page and draw attention to the important parts of the visualization: the plotted data and how they relate to each other.
We have also used animations for some visualizations to create a sense of time for our reader. We have done this for time series data, as we did not believe that a single frame could accurately explain the story we desire.
All visualizations have interactivity to provide more context for interested users.
Highlighting is used because feature distinction is highlighted and differentiated between using varying colors and boldness, for instance seen in the linear regression trendline plotted using an extra bold and larger dashed line. The coloring as an example, helps the user differentiate between categorical features (edited) Also the addition of a rectangle to highlight the predicted values of year 2021 and 2022 fits this genre. Visual structuring is included in the progress bars/time bars, where we demonstrate yearly percentage changes in EUI.
Balanced narrative between author-driven and reader-driven story While a linear path is prescribed, the interactive visualizations the viewer encounter on their path allow them discover and learn more about the data than necessarily conveyed in the message.
Choosing the correct visualizations is extremely important, as your illustrations are very much intertwined with the story you wish to tell.
By choosing the correct visualizations, you can allow a casual reader to gather the necessary information at a glance and a curious reader to dig a little deeper into the story.
We have chosen to use many different magazine-style visualizations with interactive elements to lessen the burden on the casual reader, while allowing the curious user to dig a little deeper.
Furthermore, we also utilized animations, which may require a little more attention from the viewer, but allow us to tell a more detailed story.
Our article is also heavily author driven, as we wanted to tell the story of how San Francisco is doing as well as how our analysis was built up. A different approach could have been a more reader driven format in which interactive figures could allow them to come to the same conclusions as we did.
We decided to use magazine-style visualizations, which in our case are easy for the reader to understand and digest.
We did this to allow us to use many different visualizations throughout the article, which was needed both to convey the level of detail that we desired and to allow our reader to easily follow the story that we wish to tell.
We wanted to educate our viewer, such that an employee at city hall in San Francisco, with no training in data science or machine learning, should be able to understand our actions.
We decided to use animations towards the end of the article to further engage the viewer in our story.
We recognize that animations require more attention and understanding from the viewer, so we only use these towards the end to ensure that the reader has a basic understanding of data science before they are introduced to the heavier subjects and visualizations.
We believe that our overall recommendation is well supported by the available data. We have used the data to train a machine learning model, which can adequately predict the EUI for the next years for each industry and we have used this to determine which industries to focus on in the future with regards to reducing building EUI. With that said, we believe that building EUI relies heavily on the activities within the building and some industries cannot feasible change a lot within this spectrum without large monetary investments. As we do not have the necessary context for what can be done to reduce the EUI for each industry, we can only observe what has been done and evaluate upon that. Our suggestions are therefore lacking some nuance, as we can only specify that the industries have not followed the trend, not the degree to which they are able to follow the trend.
Furthermore, we believe that other sources of information could be added to further nuance the analysis we have made throughout this notebook. We have identified a dataset regarding income in San Francisco that we believe could be able to further explain the variance in building EUI between buildings, but this was outside our selected scope for this analysis.
1. Motivation ST119
2. Basic Stats ST94
Part 2.2: Key Points ST94
3. Data Analysis ST203
6. Discussion ST119